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This  study  addresses  the  question  of  how  simple  networks  can  account  for  a  variety 
of  phenomena  associated  with  the  shift  of  a  specialized  processing  focus  across  the 
visual  scene.  We  address  in  particular  aspects  of  the  dichotomy  between  the  preattentive- 
parailel  and  the  attentive  serial  modes  of  visual  perception  and  their  hypothetical  neuronal 
implementations.  Specifically,  we  propose  the  following: 

(1)  A  number  of  elementary  features,  such  as  color,  orientation,  direction  of  movement, 
disparity  etc.  are  represented  in  parallel  in  different  topographical  maps,  called  the  early 
representation. 

(2)  There  exists  a  selective  mapping  from  this  early  representation  into  a  more  central 
representation,  such  that  at  any  instant  the  central  representation  contains  the  properties 
of  only  a  single  location  in  the  visual  scene,  the  selected  location. 

(3)  We  discuss  some  selection  rules  that  determine  which  location  will  be  mapped 
into  the  central  representation.  The  major  rule,  using  the  saliency  or  conspicuity  of 
locations  in  the  early  representation,  is  implemented  using  a  so-called  Winner-Take-All 
network.  A  hierarchical  pyramid-like  architecture  is  proposed  for  this  network.  We  suggest 
possible  implementations  in  neuronal  hardware,  including  a  possible  role  for  the  extensive 
back-projection  from  the  cortex  to  the  LGN. 
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1.  Introduction 


A  number  oi  psychophysical  studies  concerning  the  detection,  localization  and  inspection 
of  objects  in  the  visual  field  have  suggested  a  two-stage  theory  of  human  visual  perception. 
The  first  stage  is  the  "preattentive"  mode,  in  which  simple  features  are  processed  rapidly 
and  in  parallel  over  the  entire  visual  field.  In  the  second,  "attentive"  mode,  a  specialized 
processing  focus,  usually  called  the  focus  of  attention,  is  directed  to  particular  locations  in 
the  visual  field.  The  analysis  of  complex  forms  and  the  recognition  of  objects  are  associated 
with  this  second  stage  (Neisser,  1967;  Bergen  &  Julesz,  1983;  Treisman,  1983;  Ullman, 
1983;  Julesz,  1984).  The  computational  justification  for  such  a  hypothesis  comes  from 
the  realization  that  while  it  is  possible  to  imagine  specific  algorithms  performing  specific 
tasks  such  as  inspection,  counting,  marking  etc.  at  specific  locations,  it  is  difficult  to 
imagine  these  algorithms  operating  in  parallel  over  the  whole  visual  scene,  since  such  an 
approach  will  quickly  lead  to  a  combinatorial  explosion  in  terms  of  required  computational 
resources  (Ullman,  1983;  Poggio,  1984).  This  is  essentially  the  major  critique  of  Minsky 
and  Papert  to  a  universal  application  of  perceptrons  in  visual  perception  (Minsky  &  Papert, 
1969).  Taken  together,  these  empirical  and  theoretical  studies  suggest  that  beyond  a 
certain  preprocessing  stage,  the  analysis  of  visual  information  proceeds  in  a  sequence  of 
operations,  each  one  applied  to  a  selected  location  (or  locations). 

The  sequential  application  of  operations  to  selected  locations  raises  two  central  problems. 
First,  what  are  the  operations  that  the  visual  system  can  apply  to  the  selected  locations? 
Second,  how  does  the  selection  proceed?  That  is,  what  determines  the  next  location  to  be 
processed,  and  how  does  the  processing  shift  from  the  current  to  the  next  selected  location? 
In  this  paper  we  consider  primarily  the  second  of  these  questions.  With  respect  to  the  first 
question,  we  only  suggest  that  one  of  the  fundamental  operations  is  what  we  term  ” selective 
mapping".  According  to  this  view,  the  early  “preattentive"  representations  describe  the 
visual  scene  in  terms  of  a  number  of  elementary  properties  such  as  color,  orientation,  depth 
and  movement  (Treisman,  1983;  Julesz  &  Bergen,  1983).  When  a  location  is  selected,  its 
properties  are  mapped  from  the  early  representations  into  a  higher,  central  representation. 
This  central  representation  consequently  contains  the  properties  of  only  this  single  selected 
location.  With  respect  to  the  second  question,  we  discuss  a  number  of  “selection  rules" 
that  determine  the  next  location  to  be  processed.  It  is  obviously  desired  that  the  selective 
mapping  does  not  occur  at  random,  but  is  applied  to  ’’interesting"  locations.  But  how 
does  one  define  what  “interesting"  means,  without  having  to  recourse  to  higher,  symbolic 
concepts?  We  will  propose  a  specific  set  of  "selection  rules"  that  determine  which  location 
will  be  mapped  into  the  central  representation  at  any  given  time.  The  major  rule  for  the 
initial  selection  of  a  location  is  based  on  the  conspicuity  of  that  location,  i.e.  by  how  much 
its  properties  differ  from  the  property  of  its  neighborhood.  Two  rules  for  shifting  from  one 
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selected  location  to  another  are  based  on  (i)  proximity  and  (ii)  similarity  with  the  presently 
selected  location.  We  will  propose  and  discuss  simple  neuron  like  networks  that  implement 
the  selective  mapping  and  the  selection  rules.  Formulating  the  mechanism  related  to 
selective  visual  attention  in  terms  of  a  schematic  network  where  the  individual  components 
perform  simple  local  operations,  rather  than  in  the  language  of  higher  cognitive  concepts, 
has  the  advantage  that  specific  predictions  concerning  the  anatomy  and  electrophysiology 
of  the  specialized  cortical  regions  involved  in  attention  can  be  derived.  The  main  point 
we  wisk  to  make  is  not  so  much  that  the  particular  network  we  propose  is  necessarily 
implemented  in  the  brain,  but  that  the  shift  of  selective  visual  attention  and  related  visual 
operations  can  be  explained  and  modeled  using  simple  mechanisms  compatible  with  cortical 
physiolcgy  and  anatomy. 


2.  The  Early  Representation 


According  to  our  suggestion,  selective  visual  attention  operates  on  what  we  call  the  early 
representation,  a  set  of  topographical,  cortical  maps  encoding  the  visual  environment 
(Barlow,  1981).  The  early  representation  includes  a  variety  of  different  maps  for  different 
elementary  features  such  as  orientation  of  edges,  color,  disparity  or  direction  of  movement 
(see  figure  1).  Neighborhood  relations  are  preserved  in  these  maps,  i.e.  nearby  locations 
in  the  visual  scene  project  to  nearby  locations  in  the  map.  Local,  inhibitory  connections, 
mediating  lateral  inhibition,  occur  either  at  an  earlier  stage  or  within  the  feature  maps. 
Thus,  locations  that  differ  significantly  from  their  surrounding  locations  are  singled  out 
at  this  level.  The  state  of  each  of  these  maps  signals  how  conspicuous  a  given  location 
in  the  visual  scene  is:  a  red  blob  surrounded  by  similar  red  blobs  will  certainly  be  less 
conspicuous  than  a  red  blob  surrounded  by  green  blobs.  It  should  be  emphasized  that 
the  different  maps  do  not  necessarily  have  to  bo  in  physically  different  locations,  but  may 
be  intermixed.  Moreover,  these  maps  may  possibly  exist  at  different  scales,  i.e.  at  different 
spatial  lesolutions,  in  accordance  with  the  evidence  for  multiple  spatial  channels  (e.g. 
Wilson  f<  Bergen,  1979). 

In  addition  to  the  maps  for  the  different  features,  we  assume  the  existence  of  another 
topographical  map,  termed  the  saliency  map,  which  combines  the  information  of  each 
individual  map  into  one  global  measure  of  conspicuity.  Points  in  the  elementary  feature 
maps,  corresponding  to  one  location  in  the  visual  scene,  project  onto  a  unit  in  the  saliency 
map.  The  saliency  map  gives  a  "biased"  view  of  the  visual  environment,  emphasizing 
interesting  or  conspicuous  locations  in  the  visual  field.  Since  the  saliency  map  is  still  a 
part  of  the  early  visual  system,  it  most  likely  encodes  the  conspicuity  of  objects  in  terms 
of  simple  properties  such  as  color,  direction  of  motion  and  orientation.  Saliency  at  a  given 
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location  is  determined  primarily  by  bow  different  this  location  is  from  its  surround  in  color, 
orientation,  motion,  etc.  It  is  possible,  however,  that  the  relative  weight  of  the  different 
properties  contributing  to  this  representation  can  be  modulated  by  the  activity  of  some 
higher  cortical  centers. 


One  general  problem  that  must  be  solved  at  this  level  is  aligning  the  feature  maps  with 
respect  to  each  other,  i.e.  solving  the  spatial  register  problem.  Combining  the  information 
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of  the  different  feature  maps  or  retrieving  information  relevant  to  a  single  location,  requires 
a  fast  and  reliable  pathway  to  address  the  same  location  in  the  different  maps.  We  suggest 
that  this  register  is  obtained  as  part  of  the  selective  mapping  process,  described  next. 


3.  Selective  Mapping 

We  assume  that  in  addition  to  the  early  representation,  the  properties  of  objects  can  also  be 
represented  in  a  second,  ’’central"  representation.  With  respect  to  the  connections  between 
the  early  and  the  central  representation,  we  make  the  simple  assumption  that  all  the  units 
in  one  feature  map  project  to  a  single  unit  (not  necessarily  a  single  cell)  in  the  central 
representation.  For  example,  all  units  that  detect  the  presence  of  a  vertical  orientation 
anywhere  in  the  visual  field  are  connected  to  a  central  "verticality  detector".  If  the  central 
unit  is  active,  it  can  be  inferred  that  there  is  at  least  one  object  with  the  specific  feature 
somewhere  present  in  the  visual  field.  The  mapping  from  the  individual  feature  map  to  the 
feature  detector  cell  will  not  preserve  the  location.  Retrieving  location,  except  perhaps  in  a 
rough  manner  (e.g.  there  is  a  red  object  somewhere  in  the  lower  left  hemisphere),  cannot 
be  done  at  this  parallel  stage.  The  central  representation  thus  signals  the  presence  of  at 
least  one  instance  of  a  given  property,  such  as  being  red  or  horizontal.  Without  the  use  of 
selective  attention  it  has  no  way  of  "knowing",  however,  whether  different  attributes  belong 
to  a  single  object  (which  is  both  red  and  horizontal),  or  to  different  objects  in  the  visual 
field.  Finally,  if  the  projection  from  the  early  maps  to  the  feature  detector  can  distinguish 
between  one,  two,  three  and  four  or  more  active  units,  it  would  constitute  a  sort  of  parallel 
counting  device,  alleviating  the  need  for  any  perceptron-like  network  of  the  kind  explored 
by  Minsky  and  Papert  (1969)  to  compute  the  predicate:  "the  visual  scene  contains  exactly 
t  points".  Such  a  mechanism  may  explain  the  ease  and  proficiency  with  which  human 
observers  can  count  the  number  of  up  to  four  or  five  objects  present  in  the  visual  field 
(Atkinson,  Campbell  &  Francis  1976). 

Consider  next  the  problem  of  performing  a  conjunctive  search  task  (Treisman  &  Gelade, 
1980),  for  instance,  searching  for  a  target  line  segment  that  is  both  vertical  and  green  in 
the  presence  of  red  and  green  lines  that  are  either  horizontal  or  vertical.  The  mechanisms 
considered  so  far  are  insufficient  for  this  task:  the  central  representation  will  signal  the 
presence  of  red,  green,  vertical,  and  horizontal  objects,  but  will  not  represent  explicitly  the 
combination  of  these  properties.  For  computing  conjunctions  we  postulate  a  "switch"  that 
routes  the  properties  of  a  single  location,  the  selected  or  attended  location,  into  the  central 
representation,  which  will  now  only  contain  information  relevant  to  the  selected  location. 
Note  that  the  computations  required  to  abstract  properties  from  the  information  contained 
in  the  visual  input  map  are  performed  within  the  early  representation,  i.e.  prior  to  the 
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selection  process,  and  not  subsequent  to  it.  This  distinction  is  important,  for  example,  in  the 
computation  of  color.  As  has  been  demonstrated  psychophysically  (e.g.  Land,  1959),  the 
computation  underlying  color  perception  is  a  global  process,  that  requires  the  entire  visual 
field  (or  a  large  portion  of  it).  It  is  therefore  reasonable  to  assume  that  the  computation  of 
color  and  other  properties  proceeds  within  the  early  representation,  prior  to  the  selection 
of  a  location  for  further  processing. 

3.1  Selective  Mapping:  The  Basic  Mechanism 

The  operations  underlying  the  selective  routing  of  information  from  the  early  representation 
to  the  central  one  can  be  performed  by  two  complementary  cellular  networks  (see  also  figure 
5).  One  such  network,  called  the  Winner-Take-All  network  (WTA  network;  see  Feldman, 
1982,  who  introduced  this  term,  Feldman  &  Ballard,  1982)  localizes  the  most  active  unit  in 
the  saliency  map  while  the  second  network  relays  the  properties  of  the  selected  location 
to  the  central  representation.  It  is  here  that  specialized  visual  routines  for  the  extraction 
of  shape  and  form  can  be  applied  to  the  properties  of  the  selected  location.  At  any  given 
time  only  one  location  is  selected  from  the  early  representation  and  copied  into  the  central 
representation.  The  WTA  network,  equivalent  to  a  maximum  finding  operator,  operates  on 
the  output  of  the  units  in  the  saliency  map.  In  a  neural  network  *,  can  be  interpreted  as 
the  electrical  activity  (intracellular  voltage  or  spiking  rate)  of  the  unit  at  location  ».  The  WTA 
mechanism  maps  this  set  of  input  units  onto  an  equal  number  of  output  units,  described 
by  yu  using  the  transformation  rule: 

I u  ~  0  ij  Xi  <  max{xy} 

1  (I) 

1 li  —  /(*.)  *7  *.  =  tnux{zj} 

J 

where  /  is  any  increasing  function  of  x,  (including  a  constant).  All  output  units  are  set  to 
zero  except  the  one  corresponding  to  the  most  active  input  unit. 


3.2  The  Winner-Take-All  Network 

Building  a  WTA  network  may  appear  as  a  straightforward  task,  but  complications  arise 
when  the  intrinsic  properties  of  biological  hardware  are  taken  into  account.  Depending  on 
the  underlying  hardware,  two  extremes  for  computing  the  maximum  of  a  given  set  can 
be  envisioned.  On  a  serial  machine,  the  simplest  algorithm  is  a  sequential  search  for  the 
largest  number  through  the  entire  input  set.  The  drawback  to  this  method  is  that  for  n 
discrete  inputs,  »  basic  time  steps  are  required  (by  basic  time  step  we  always  refer  to 
the  time  required  to  execute  an  elementary  operation  such  as  comparing  two  numbers).  A 
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Figure  2.  A  schematic  drawing  illustrating  the  Winner-Take-All  (WTA)  network  computing  the 
maximum  zk  of  a  set  of  n  input  units  in  the  salient  map.  It  localizes  the  most  conspicuous  point  by 
a  number  of  parallel  operations  and  activates  the  corresponding  output  line  (in  this  case  unit  **) 
after  at  most  2logmn  time  steps  (if  m  units  can  be  compared  simultaneously). 

highly  parallel  machine  with  n  processors,  each  one  having  direct  access  to  the  other  n  -  l 
processors,  can  compute  the  maximum  in  one  time  step  by  comparing  simultaneously  the 
vilue  of  each  processor  with  the  values  of  all  the  other  processors.1 
A  simple  implementation  one  may  suggest  for  the  WTA  network  is  a  mutual  inhibitory 
network  of  the  type  studied  by  Hadeler  (1974),  where  every  unit  inhibits  every  other  unit. 
In  these  networks,  neurons  are  assumed  to  be  linear  summation  devices,  followed  by  a 
threshold  operation  (see  for  instance  McCullouch  &  Pitts,  1943).  They  can  be  described  by 


z'+'  =  /(/» +  £  uiij**),  (2) 

where  s' f 1  is  the  output  of  unit  »'  at  time  t  +  I,  is  the  synaptic  weight  between  the 

i-th  and  the  v'-th  cell,  /,  is  the  input  to  the  ?-th  cell  and  /(*)  is  zero  for  all  x  <  and 

a  monotonic  increasing  function  of  x  above  this  positive  threshold  value.  Such  a  mutual 
inhibitory  network  will  be  unable,  however,  to  implement  the  WTA  computation.  The  reason 
is  the  following.  The  requirement  that  for  any  set  of  inputs  /,-,  only  a  single  output  survives 
implies  <  -I  for  every  i,j.  If  this  condition  is  met,  then  for  many  input  sets  the  network 
described  by  equation  (2)  will  oscillate  and  fail  to  converge.  Convergence  is  only  guaranteed 

'This  is  essentially  the  mechanism  Feldman  and  Ballard  (100?)  propose  lor  their  implementation  of 
a  WTA  network. 
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if  the  largest  input  is  larger  than  the  sum  of  the  other  inputs,  /,  >  £  h-  Otherwise,  the 
network  will  oscillate.  Thus,  in  practice,  these  networks  will  fail  for  more  than  2  units.  We 
conjecture  that  this  undesirable  property  is  unavoidable  in  any  cellular  network  where  the 
individual  components  have  only  access  to  the  summed  activity  of  the  converging  cells,  i.e. 

jxt.  A  possible  remedy  for  this  problem  is  the  introduction  of  an  amplitude-dependent 
time-course  of  *f.  If,  for  instance,  the  unit  with  the  largest  output  responded  faster  than 
units  with  smaller  output,  it  could  inhibit  its  competitors  before  it  would  be  inhibited  by 
them,  thus  avoiding  oscillations.  In  this  manner  it  is  possible  to  combine  more  that  two 
units,  but  it  would  still  be  difficult  to  construct  a  network  of  this  type  with  large  number  of 
elements. 

In  the  previous  discussion  neurons  were  assumed  to  be  simple,  linear  threshold  devices. 
It  has  been,  however,  long  realized  that  neurons  are  complicated  computational  machines 
performing  a  variety  of  logical  operations  on  their  input  (e.g.  Schmitt,  Dev  &  Smith, 
1976).  Even  taking  account  of  these  more  realistic  neurons,  it  is  still  difficult  to  envisage 
an  implementation  of  the  WTA  computation  by  a  single  uniform  network.  Moreover,  the 
requirement  that  each  unit  in  the  network  is  connected  to  every  other  unit  seems  prohibitive 
in  terms  of  numbers  and  the  non-locality  of  the  connections. 

We  therefore  propose  a  different  cellular  mechanism,  based  on  the  following  two  assumptions. 

(1)  Except  for  some  long-range  excitatory  connections,  most  connections,  whether  excitatory 
or  inhibitory,  are  local. 

(2)  Each  elementary  processing  unit  only  performs  some  simple  well-specified  operation, 
such  as  addition  or  multiplication.  In  particular,  the  basic  processing  units  are  unable  to 
use  any  symbolic  information,  such  as  addresses. 

The  basic  version  of  the  WTA  network  consists  of  two  intercalated  pyramid-like  structures.4 
The  network  operates  in  a  highly  parallel  fashion  by  computing  the  maximum  of  a  small 
number  m  of  units  across  the  whole  input  set.  Next,  comparisons  are  made  among  these 
local  maxima  to  compute  again  the  most  active  unit.  These  comparisons  are  repeated 
k  =  /«</,„ n  number  of  times  until  the  global  maximum  has  been  determined.  Figure  3  shows 
one  particular  implementation  of  the  WTA  network  with  m  ---  2.  The  more  active  unit  inhibits 
the  less  active  unit  and  transmits  its  activity  onto  the  next  higtier  level.  Here,  among  n/2 
units,  the  process  is  repeated.3  Under  the  assumption  that  the  connections  between  the 
levels  transmit  faithfully  the  activity  of  the  units,  the  top-most  unit  in  the  pyramid  will  hold 
the  activity  x,  of  the  global  maximum  after  k  time  steps.  However,  it  is  the  location  of  the 

Hlierarchical,  pyramid-like  computer  architectures  have  been  proposed  for  image  processing  and 
analysis.  For  an  overview  of  their  use  see  (Rosenlcld,  1984). 

’The  computational  structure  is  similar  to  the  Wimbledon  tennis  tournament  where  players  drop  out 
if  they  lose  a  single  match  (a  so-called  knock-out  competition). 
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maximum  and  not  its  absolute  value  which  is  of  relevance  for  the  selection  process.  The 
location  of  the  corresponding  unit  in  the  saliency  map  can  be  obtained  by  the  use  of  the 
second  pyramid,  having  a  reversed  flow  of  information  with  respect  to  the  first  pyramid. 
It  "marks"  the  path  of  the  most  active  unit  through  the  first  pyramid,  activating  finally  the 
output !/:  of  the  WTA.  This  is  done  with  the  help  of  an  auxiliary  unit  associated  with  every 
unit  in  the  first  pyramid  (called  the  main  unit).  The  auxiliary  unit  is  only  activated  if  it 
receives  conjoint  excitation  from  its  main  unit  and  from  the  auxiliary  unit  at  the  next  higher 
level.  Since  at  every  level  the  most  active  (main)  unit  in  a  local  comparison  suppresses  the 
activity  of  the  other  m  -  I  (main)  units,  the  associated  auxiliary  units  as  well  as  all  auxiliary 
units  in  the  subtree  below  them  can  never  be  activated.  After  another  k  time  steps,  the 
output  y  ,  corresponding  to  the  most  active  unit  in  the  saliency  map,  will  be  activated,  while 
the  rest  of  the  output  units  remain  silent.  Except  for  the  pathological  case  when  two  or 
more  inputs  are  exactly  equal,  the  WTA  network  will  always  converge  to  a  unique  solution 
within  at  most  2 logmn  time  steps.  It  can  be  built  with  no  more  than  2n  units.  This  is 
immediately  established  by  adding  up  the  total  number  of  units  at  each  of  the  fc  =  lotjmn 
levels 


n 

n  +  —  f 
m 


n 


+ 


This  expression  is  smaller  than  the  infinite  geometric  series 

n(t  -(-  m_i  +  +  •  •  •  +  m~k  +  m~k~l  +  •••), 

since  the  terms  of  the  order  of  and  higher  vanish.  This  series  converges  to  n^, 

if  to  >  I.  The  factor  2  takes  both  pyramids  into  account.  Notice,  that  for  all  integers  m, 
the  WTA  network  can  always  be  built  with  less  than  In  units.  Assuming  that  the  optic 
nerve  contains  approximately  I06  fibers  and  that  m--  10  neurons  can  compare  their  activity 
simultaneously,  a  WTA  network  covering  the  entire  retinal  image  would  require  no  more 
than  2.2  •  ll)f>  neurons,  a  small  fraction  of  all  visual  neurons.  If  only  the  Y-system,  with 
its  associated  short  delay,  high  movement  sensitivity,  large  receptive  fields  and  transient 
temporal  response,  provided  the  major  input  to  the  WTA  network,  this  number  would  drop 
substantially.  In  the  cat,  about  1%  of  all  ganglion  cells  are  of  the  Y-type.  If  this  percentage 
carries  over  to  primates  and  man,  a  WTA  network  for  the  entire  visual  field  could  be  built 
with  just  UP  neurons.  Interestingly,  the  computational  architecture  of  the  WTA  network  is 
reminiscent  of  the  K-  and  P- pyramids  proposed  by  Minsk y  for  his  K  line  theory  of  memory 
(Minsky,  1979). 


3.3  Mapping  the  Selected  Location  into  the  Central  Representation 
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Figure  3.  A  possible  implementation  of  the  Winner-Take-All  netwoik  with  n  ~  8  input  units.  The 
local  comparison  takes  place  between  m  -----  2  units.  The  more  active  unit  inhibits  the  less  active 
one  and  excites  the  unit  on  the  next  level.  The  auxiliary  units,  drawn  in  black,  are  only  activated  if 
they  receive  conjointly  excitation  from  their  associated  main  unit  and  from  the  auxiliary  unit  at  the 
higher  level.  The  auxiliary  unit  y.,  corresponding  to  the  most  active  unit  n  in  the  saliency  map,  will 
be  activated  after  at  most  2logn, »  ~  8  time  steps.  In  order  to  insure  stability  against  noise  and  to 
enforce  neighborhood  relations  between  all  neighboring  points  (for  instance  between  the  two  middle 
units,  belonging  to  two  dilfcrcnt  subtrees)  additional  connections  (and  units)  can  be  added  between 
(and  within)  levels.  We  have  just  shown  the  most  sparse  implementation  of  a  WTA  network. 


Once  the  most  conspicuous  point  has  been  localized  in  the  saliency  map,  its  properties,  i.e. 
the  information  contained  within  the  early  representation,  must  be  copied  into  the  central 
representation.  The  routing  of  this  information  can  be  achieved  by  removing  some  tonic 
inhibitory  influence  or  by  increasing  the  amount  of  excitation  at  the  selected  location  in 
the  early  representation.  We  will  not  suggest  here  specific  mechanisms  for  the  mapping 
operation.  The  crucial  point  is  that  the  WTA  network  directs  the  "copy"  operation  to  a 
single  selected  location  (figure  2).  Note,  that  the  selection  system  itself  is  not  responsible 
for  the  information  processing  relevant  to  the  visual  task  but  simply  selects  which  area  of 
visual  space  should  be  inspected  (Posner,  Snyder  &  Davidson,  1980).  It  can  be  likened  to  a 
spotlight  illuminating  some  portion  of  the  visual  field.  This  view  of  attention  is  in  accordance 
with  the  fact  that  it  is  not  possible  for  visual  attention  to  be  allocated  simultaneously  to  two 
different  positions  in  space  (Posner  of  a/.,  1980). 
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Figure  4.  Shifting  visual  attention  within  the  salient  map.  Once  the  most  conspicuous  locatic 
(point  i)  has  been  detected  and  examined,  its  corresponding  output  x,  1  decays  and  the  W1 
mechanism  shifts  to  the  next  most  salient  location,  2.  The  time  needed  to  find  the  next  locatk 
increases  with  increasing  distance  between  locations  1  and  2. 

3.4  Shifting  the  Processing  Focus 

Until  now  we  have  only  considered  the  initial  selection  of  an  "interesting"  locatio 
But  how  does  the  selection  process  move  from  one  location  to  the  next,  i.e.  how  ce 
selective  attention  shift  across  the  visual  field  (Shulman,  Remington  &  McLean,  1979 
From  psychophysical  experiments  it  is  known  that  it  takes  some  measurable  time  to  sh 
the  focis  of  attention  from  one  location  to  another  (Eriksen  &  Schultz,  1977;  Tsai,  198C 
There  is  some  evidence  that  this  time  increases  with  the  distance  between  these  locatioi 
(Shulman  el  al.  1979;  Tsai  1983;  see,  however,  Remington  &  Pierce,  1984). 

A  simple  way  to  introduce  such  dynamics  into  our  model  is  to  let  the  conspicuity 
the  maximal  active  unit  in  the  saliency  map  decay,  even  if  constant  visual  stimuli  a 
present.  This  decay  may  be  implemented  either  locally  or  centrally  (or  by  some  combinatic 
of  the  two  methods).  By  "local"  we  mean  that  an  active  location  in  the  saliency  mi 
adapts  and  decays  after  a  while.  By  "central"  we  mean  that  once  the  information  fro 
the  early  representation  has  been  relayed  to  the  central  representation  a  signal  is  se 
back,  inhibiting  the  most  active  unit  in  the  saliency  map,  i.e.  its  conspicuity  fades.  Tf 
WTA  network  responds  to  the  new  input  configuration  by  shifting  away  from  the  presenl 
selected  location  and  towards  the  next  most  conspicuous  location.  The  convergence  tim 
i  e.  the  time  taken  by  the  WTA  network  to  converge  to  the  newly  selected  location,  depen< 
primarily  on  the  distance  between  the  two  locations.  In  the  worst  case  it  will  take  2 log,, 
time  steps  for  the  new  maximum  to  propagate  up,  and  subsequently  down,  the  logmn  laye 
(see  figure  4),  assuming  that  the  comparison  of  m  units  can  be  done  in  one  time  ste 
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Shorter  convergence  time  can  be  achieved  if  the  two  locations  are  close  to  each  other. 
Note,  that  the  dependency  of  the  convergence  time  on  distance  follows  naturally  from  the 
computational  architecture  of  the  WTA  network  and  does  not  have  to  be  artificially  imposed. 
In  our  previous  example  of  a  WTA  network  with  n  =  10s  and  m  —  10,  a  solution  will  always 
be  found  after  at  most  12  time  steps.  Since  time-constants  for  neurons  are  in  the  msec 
range,  this  number  seems  broadly  compatible  with  the  estimated  30  -  50  msec  required  to 
shift  visual  attention  to  a  new  location  (Bergen  &  Julesz,  1983).  After  a  new  location  has 
been  selected,  the  visual  information  associated  with  this  location  is  routed  to  the  central 
representation.  The  local  scheme  is  similar,  except  that  the  most  active  unit  is  locally 
inhibited,  for  instance  at  some  fixed  time  after  the  WTA  mechanism  has  converged.  These 
schemes  are  non  exclusive;  in  fact,  it  seems  likely  that  some  local,  automatic  mechanism 
might  always  be  in  operation.  The  cortical  mechanism  is  only  invoked  when  a  voluntary 
shift  of  attention  is  desired  (Posner,  1980).  The  basis  for  both  mechanisms  is  a  long  lasting 
inhibition  of  the  selected  unit  in  the  saliency  map  preventing,  for  a  given  time  period,  that 
the  attentional  focus  will  revisit  this  location.  A  temporary  inhibition,  lasting  more  than  500 
ms,  has  been  reported  by  Posner,  Cohen  and  Ratal  (1982)  after  attentional  shifts  av/ay  from 
a  cued  location.  Processing  efficiency  appears  to  be  reduced  from  locations  in  the  visual 
field  once  attention  is  withdrawn. 

In  summary,  selective  attention  in  the  case  we  have  considered  requires  three  different  stages 
(see  figure  1).  First  a  set  of  elementary  features  is  computed  in  parallel  across  the  visual 
field  and  is  represented  in  a  set  of  cortical,  topographical  maps.  Locations  in  visual  space 
that  differ  from  their  surround  with  respect  to  an  elementary  feature  such  as  orientation, 
color  or  motion  are  singled  out  in  the  corresponding  map.  These  maps  are  combined 
into  the  saliency  map,  encoding  the  relative  conspicuity  of  the  visual  scene.  Second, 
the  WTA  mechanism,  operating  on  this  map,  singles  out  the  most  conspicuous  location. 
Thirdly,  the  properties  of  this  selected  location  are  routed  to  the  central  representation. 
The  WTA  network  then  shifts  automatically  to  the  next  most  conspicuous  location.  The 
visual  system  processes  a  scene  in  a  sequential  and  automatic  way  by  selectively  inspecting 
the  information  present  in  conspicuous  locations.  The  mechanism  sketched  here  might  of 
course  not  only  be  used  for  the  shift  of  the  attentional  focus  but  also  for  such  visual 
routines  as  tracking  of  contours,  counting  objects  or  marking  a  specific  location  (Ullman, 
1983). 


4.  Two  Rules  for  Shifting  the  Processing  Focus 


Should  there  be  any  systematic  relationship  between  the  current  location  and  the  next 
location  to  be  selected?  If  no  such  relationship  is  enforced,  it  would  seem  difficult  to 
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visually  inspect  areas  of  the  visual  field  without  constantly  shifting  to  conspicuous,  but 
distant,  locations.  Objects  tend  to  occupy  a  compact  region  in  space  with  similar  properties 
(color,  motion,  etc.).  If  the  shifting  apparatus  is  to  scan  automatically  different  parts  of  a 
given  object,  it  is  useful  to  introduce  a  bias  based  on  both  spatial  proximity  and  similarity. 
Searching  for  an  ''interesting"  target  around  a  selected  location  would  profit  from  a  selection 
mechanism  biased  to  nearby  locations  (what  we  call  proximity  preference ).  Scanning  the 
visual  field  for  objects  with  a  common  identifying  feature,  for  instance  the  color  red,  would 
be  likewise  facilitated  if  locations  with  similar  features  to  the  presently  selected  location  are 
preterrentialy  selected  (similarity  preference).  Both  mechanisms  are  related  to  phenomena 
on  perceptual  grouping  and  "Gestalt  effects"  which  occur  as  a  function  of  object  similarity 
and  spatial  proximity  (Wertheimer,  1923;  Beck,  1967).  The  next  two  section  discuss  these 
rules  in  more  detail. 


4.1  Proximity  Preference 

It  would  seem  advantageous  from  a  computational  point  of  view,  if  the  selection  process 
shifts  preferentially  to  conspicuous  locations  in  the  neighborhood  of  the  presently  selected 
location,  instead  of  shifting  to  the  global  maximum  independent  of  any  locality  considerations. 
Inspecting,  for  instance,  part  of  a  visual  image  for  the  occurance  of  some  special  feature 
(or  conjunction  of  features)  could  be  performed  much  more  efficiently  if  the  search  is 
automatically  limited  to  some  neighborhood.  The  simplest  way  of  implementing  such  a 
proximity  preference  within  the  framework  of  the  WTA  mechanism  is  to  enhance  all  units 
in  the  neighborhood  of  the  currently  selected  unit  in  the  saliency  map.  Such  a  preference 
can  be  incorporated  in  a  straightforward  manner  into  the  network  described  earlier.  More 
specifically,  we  assume  that  the  output  of  the  WTA  mechanism  associated  with  the  presently 
attended  location  enhances  the  conspicuity  of  nearby  units  in  the  saliency  map  by  a  factor 
depending  on  the  distance  between  the  location  and  its  neighbors,  thereby  facilitating  shifts 
of  the  processing  focus  to  nearby  locations.  This  is  equivalent  to  postulating  the  existence 
of  an  attractive  potential  around  every  selected  location.  Some  experimental  evidence  for 
this  type  of  interaction  is  provided  by  Engel  (1971,  1974).  His  results  indicated  that  the 
probability  of  detecting  a  target  depends  on  the  proximity  of  the  location  being  attended 
to. 


4.2  Similarity  Preference 

On  similar  computational  grounds  one  can  justify  the  existence  of  a  similarity  preference. 
We  postulate  therefore  the  existence  of  an  interaction  between  similar,  elementary  features; 
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the  processing  focus  will  preferentially  shift  to  a  location  with  the  same  or  similar  elementary 
features  as  the  presently  selected  location.  Such  a  mechanism  assumes  interactions  within 
individual  elementary  feature  maps,  but  not  between  them,  and  therefore  it  does  not 
require  precise  topographic  mappings  between  the  different  elementary  feature  maps.  The 
interaction  will  be  activated  by  the  output  of  the  WTA  network.  This  output  (y*  in  figure  2) 
increases  the  excitability,  viz.  the  conspicuity,  of  all  units  in  a  neighborhood  of  the  selected 
location  within  those  elementary  feature  maps  where  the  corresponding  features  have  been 
detected.  If  the  currently  selected  location  contains  for  instance  a  red,  horizontal  line,  then 
neighboring  units  in  the  feature  map  for  horizontal  and  red  will  be  facilitated.  The  processing 
focus  will  now  preferentially  shift  to  either  red  and/or  horizontal  targets.  The  effect  of  the 
similarity  preference  is  opposite  to  the  initial  bias  towards  conspicuous  locations.  Locations 
with  similar  properties  initially  inhibit  each  other.  After  a  location  has  been  selected,  it 
tends  to  facilitate  the  conspicuity  of  nearby  locations  with  similar  properties.  Although 
the  two  processes  have  opposite  effects,  they  can  both  be  implemented  without  causing 
undesirable  contradiction  or  interference.  The  first  occurs  early  on  within  the  individual 
maps  and  is  implemented  by  local  inhibition  within  the  maps.  The  similarity  preference  can 
be  implemented  by  a  feedback  from  the  output  of  the  WTA  network  to  all  the  different 
feature  maps.  Finally,  it  would  be  expedient  if  the  similarity  preference  for  individual  features 
could  be  switched  on  or  off  voluntarily  (look  for  red  objects  i.e.  facilitate  the  red  feature 
map),  but  it  is  unclear  to  what  degree  such  a  control  actually  exists. 

A  partial  experimental  support  for  this  type  of  interactions  comes  from  a  recent  study  by 
Geiger  and  Lettvin  (1984)  who  investigate  the  influence  of  the  attended  location  on  lateral 
masking.  If  the  subject  fixates  a  central  point,  while  a  group  of  three  letters  is  flashed  onto 
the  screen  at  some  distance  from  the  central  point,  the  subjects  are  usually  unable  to  name 
the  central  letter.  However,  if  a  copy  of  the  interior  letter  is  flashed  at  the  fixation  point, 
the  letter  in  the  periphery  transiently  stands  out  against  its  neighbors  in  the  string. 


5.  Biological  Considerations 


What  could  the  anatomical  correlate  for  our  selection  mechanism  be?  The  maps  for  the 
different  elementary  features  are  most  likely  localized  in  areas  within  and  beyond  the  striate 
cortex,  such  as  MT  and  MST  for  motion,  and  perhaps  V4  for  color.  Does  it  then  necessarily 
follow  that  the  saliency  map,  which  combines  aspects  from  the  different  elementary  feature 
maps,  must  be  located  beyond  these  areas?  One  intriguing  possibility  is  that  the  saliency 
map  resides  either  at  the  level  of  the  lateral  geniculate  nucleus  (LGN)  or  in  the  striate 
cortex,  VI  (see  figure  5).  The  LGN  in  the  cat  and  striate  cortex  in  primates,  represent  the 
last  major  station  along  the  retino-geniculo-cortico  pathway  before  the  visual  information  is 
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Figure  5.  A  biological  implementation  for  the  selection  process.  The  salioncy  map  may  be 
localized  either  within  the  lateral  geniculate  nucleus  (LGN)  or  within  the  striate  cortex  (VI).  The 
backprojection  from  the  different  cortical  maps  for  different  properties  (for  instance  PI  and  P2) 
solve  the  spatial  register  problem.  The  WTA  network  selects  the  most  active  unit  in  the  saliency 
map,  stih>»v|iipntly  muting  the  information  corresponding  to  this  selected  location  >nto  the  central 
representation.  Interestingly,  Crick  proposed  recently  (1904;  see  also  Yingling  &  Skinner,  1977) 
that  the  attentional  searchlight  is  controlled  by  the  thalamic  reticular  nucleus,  a  layered  structure 
surround  ng  the  thalamus.  It  receives  extensive  feedback  from  the  visual  cortex  and  projects  onto 
the  principal  relay  cells  in  the  LGN. 

dispersed  to  different  regions.  The  Y-pathway  projects  to  the  striate  and  extrastriate  areas 
VI,  V2  and  V3  in  cat,  but  predominately  to  VI  in  the  monkey.  The  X-pathway  behaves 
similar  in  both  cat  and  monkey,  projecting  predominantly  to  VI.  The  W-pathway,  much 
less  explored,  in  addition  to  sending  fibers  to  VI,  V2  and  V3  probably  also  innervates  the 
medial  Glare-Bishop  area  and  area  21a  (Graybiel  &  Berson,  1981;  Sherman,  1984).  One 
puzzling  feature  about  the  LGN  is  the  existence  of  ,m  extensive  reciprocal  projection  from 
the  cortex  onto  the  LGN  (Macchi  &  Rinvik,  1976).  This  connection  observes  the  general 
principle  that  for  every  geniculocortical  projection  there  is  a  corresponding  corticothalamic 
pathway  Although  little  information  on  the  number  of  fibers  involved  in  this  back  projection 
is  available,  estimates  suggest  that  it  is  at  least  as  massive  as  the  forward  projection,  and 
perhaps  considerably  stronger  (Gilbert  and  Kelly  (1975)  estimate  that  about  half  of  all  cells 
in  layer  VI  in  the  cat  striate  cortex  send  their  axons  to  the  LGN). 


These  strong  reciprocal  connections  could  be  used  to  solve  the  spatial  register  problem  in 
the  manner  suggested  in  figure  5.  The  visual  environment  is  encoded  at  the  level  of  the 
LGN  or  VI  in  neurons  having  circular  symmetric  receptive  fields.  Subsequently,  different 
properties  such  as  color,  motion,  disparity  etc.  are  processed,  analyzed  and  represented  in 
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different  regions  of  the  cortex.  These  regions  then  project  back  to  the  LGN  (via  Vi).  If,  for 
instance,  in  the  area  computing  color  a  single  location  stands  out  among  all  others,  this 
location  will  enhance  the  corresponding  location  in  the  LGN.  Similarly,  the  different  visual 
maps  all  feed  back  into  the  saliency  map,  providing  it  with  a  measure  of  the  strength  and 
importance  of  the  different  features.  The  WTA  network  now  finds  the  most  active  unit  in 
the  saliency  map. 

This  arrangement  provides  a  mechanism  for  spatial  register,  since  all  the  information 
pertaining  to  the  selected  location  is  transmitted  together  to  the  central  representation.  A 
notable  limitation  of  this  mechanism  is  that  spatial  register  is  obtained  for  one  location  at  a 
time,  a  property  that  is  consistent  with  psychophysical  evidence  (Treisman  &  Gelade,  1980). 
In  terms  of  connections  among  different  visual  areas,  this  arrangement  has  two  interesting 
properties.  First,  it  requires  an  extensive  topographic  projection  from  the  cortex  back  to 
the  LGN  (Tsumoto,  Creutzfeldt  &  Legendy,  1978).  Second,  it  does  not  require  precise 
topographic  reciprocal  interconnections  among  all  the  different  visual  maps. 
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